System and method of screening unstructured messages and communications

ABSTRACT

Embodiments of the present invention include a system and method of screening unstructured messages and communications. In one embodiment, messages and communications may be received in the form of email and telephone transcripts. In one embodiment, the present invention includes a method of extracting text from email and telephone transcripts and screening the content of the messages in order to pick out useful and relevant information using a list of words and phrases that can be described as industry recognized words and phrases. Industry recognized words and phrases are matched against the contents of the messages and communications to determine what part of the message or communication is relevant to an aspect of business.

CROSS REFERENCE TO RELATED APPLICATIONS

This invention claims the benefit of priority from U.S. ProvisionalApplication No. 60/668,011 filed Apr. 4, 2005, entitled “System andMethod of Screening Unstructured Messages and Communications”.

BACKGROUND

The present invention relates to unstructured data processing, and inparticular, to systems and methods of screening unstructured messagesand communications.

Unless otherwise indicated herein, the approaches described in thissection are not necessarily all prior art to the claims in thisapplication and are not admitted to be prior art by inclusion in thissection.

The world of information technology can be divided into twoenvironments—unstructured data and processing and structured data andprocessing. The structured world is a world of databases, transactions,records, data layouts, reports and the like. Structured data processingconsists of business transactions, usually involving money. For example,ATM activities, airlines reservations, insurance premium processing,inventory management are all standard forms of structured data andprocessing. The unstructured world is a world of spreadsheets, emails,telephone conversation transcripts, documents, and text. Unstructureddata and processing are those activities—usually messages andcommunications—that occur inside the corporation that are unbound byrecords, form, or content. An unstructured activity has no predeterminedlimitations on it.

It has been recognized that these worlds exist separate and apart.Technology either fits into one world or the other. There is very littlecrossover technology between the two worlds. But there are majoropportunities waiting for technology that crosses the bridge between thestructured world and the unstructured world.

For years unstructured data has been collecting and passing throughorganizations. The unstructured data takes the form of messages andcommunications. Typically, the sources of unstructured messages andcommunications are email and transcribed phone conversations. Once intoa textual format, these messages and communications stay within theboundaries of unstructured data.

But there are great possibilities for exploitation if those messages andcommunications were to be intersected with structured data.Unfortunately the lack of structure, the lack of format, and the lack offamiliar and manageable content makes it difficult, if not impossible,to blend structured data with the unstructured messages andcommunications. For example, the content of unstructured communicationstypically has no format, no structure, no limitations. The message orcommunication can be long or short. The message can be in English,Russian, or any other language. The communication can be in sentences orprose. In short there is no structure, format or limitation onunstructured communications. What is needed is a means of relating thetwo worlds.

The common link between the two worlds of structured data andunstructured data is text. But text is used so differently in the twoenvironments that merely matching text causes even more confusion. Inorder to make sense of text that can be used for linking the worlds ofstructured data and unstructured data, it is necessary to be able tolook at the unstructured messages and communications and pluck out ofthat environment the text that is meaningful to other environments, suchas the structured environment.

The lack of structure found in messages and communications presents aprofound barrier to the use of unstructured data—messages andcommunications—in the context of business. Because of the lack ofstructure, classical structured techniques of organizing and accessingdata into transactions, records, and databases do not work. In order tostart to use unstructured messages and communications in the structuredworld, some special processing must be done against the unstructureddata—messages and communications—to make the data fit for processing inthe structured environment.

When it comes to messages and communications, merely placing messagesand communications in the structured environment is a wasteful andineffective thing to do. When messages and communications are placedinto the structured environment, there are several problems. First,messages and communications take up huge amounts of space. The amount ofbulk consumed by messages and communications makes them expensive tohandle and awkward to process. Second, many of the messages andcommunications are not relevant to the business or organization andtypically such messages are not useful for making business decisions,yet they still take up space and must be handled. Additionally, mostparts of the messages and communications that do relate to the businessare not directly useful. Yet the entire message must be stored, which iswasteful and causes inefficient processing.

FIG. 1 shows how an organization has merely placed unstructured messagesand communications in the structured environment. The result might bemessages and communications in the structured environment such as themessage depicted in 100 stored in database 110, wherein the pieces ofinformation span the realms of both personal and business information.These messages and communications are hard to analyze or index, as thesemessages can be about anything. There may be massive amounts of dataplaced into the structured environment that have nothing to do with anyaspect of business. About the only way to make sense of these messagesis to read each message or communication in its entirety. Given thatthere may be many, many messages such an approach is not practical.

Most of the messages and communications do not have anything to do withbusiness. And for those messages and communications that do havesomething to do with business, the information is disorganized anddifficult to find. To find something of importance requires a scanthrough all of the documents. When there are only 30 or 40 documents,such a scan is only a bother. But when there are tens of thousands ormore documents, a manual scan becomes a truly arduous task and becomesvery impractical.

Thus, what is needed is a method of screening unstructured business datain a way that will improve the efficiency, speed and quality ofinformation available for making business decisions while also reducethe cost to store and process such data. The present invention solvesthese and other problems by providing an efficient information screeningmethod that may be used to transfer unstructured messages andcommunications into the structured world.

SUMMARY

The present invention pertains to a method of screening unstructuredmessages and communications. Features and advantages of the presentinvention include separating useful information (e.g., for a business orenterprise) in messages and communications from unuseful information(i.e., blather). Embodiments of the present invention may determinewhich part of the messages and communications are relevant to thebusiness and classify the business relevant messages and communicationsas to what business subjects they are relevant to.

By analyzing messages and communications, the unnecessary blather can bediscarded, and only the relevant business terms can be sent to thestructured environment. This greatly reduces the need for storingunnecessary data in the structured environment and greatly speedsprocessing in that only relevant and useful terms are stored in thestructured environment.

In one embodiment, text captured from email and telephone transcripts isscreened and the content of the messages is categorized in order to pickout useful and relevant information using a list of words and phrases ofdescribed as industry recognized words and phrases. The industryrecognized words and phrases are matched against the contents of themessages and communications to determine what parts of the message orcommunication are relevant to an aspect of business.

In one embodiment, in order to make an industrial recognition approachwork, it is necessary to have a list of industry used terms. There areindustrial categories and within those categories there are terms thatbelong to those categories. Typical categories might be accounting,finance, human resources, compliance, ethics, and so forth.

In one embodiment, the words and phrases of each message andcommunication are passed through a screening program. The screeningprogram looks at each word or phrase and attempts to match the word orphrase form the message or communication with the words and phrasesfound in the industrial lists. When a match is made, also called “ahit”, a record is written for the match.

In one embodiment, at the end of the screening process, messages andcommunications can be divided into one of two classes—useless and usefulcommunications (e.g., relevant or irrelevant to a business).

In one embodiment, the business useful messages and communications canbe further divided into different classes based on the relevance of themessage or communication to industry categories. In other words, amessage can be deemed to be relevant to accounting and finance, but nothuman resources and sales.

In one embodiment, once the messages and communications are screened,they can then be linked to structured data, or they can be furtherprocessed based on the results of the screening that has been done.

In one embodiment, the present invention includes a method of convertingunstructured data into structured data comprising reading unstructuredtext based data, comparing said unstructured text based data against apredefined list of terms, and generating one or more structured recordsif a term in the text based data matches a term in the predefined list.

In one embodiment, the unstructured text based data comprises aplurality of text messages or communications, and the method furthercomprises automatically deleting a message or communication if a term inthe predefined list does not match any term in the message orcommunication.

In one embodiment, the method further comprises storing the one or morerecords in a database.

In one embodiment, the text based data are a plurality of emails.

In one embodiment, the method further comprises converting audio to textbased data.

In one embodiment, terms in the text based data are compared againsteach term in the predefined list.

In one embodiment, a match occurs if the term in the text based data isan exact match with the term in the predefined list.

In one embodiment, a match occurs if the term in the text based data isa stemmed match with the term in the predefined list.

In one embodiment, the predefined list includes categories.

In one embodiment, the method further comprises grouping records bycategories in the predefined list.

In one embodiment, the predefined list includes subcategories.

In one embodiment, the method further comprises grouping records bysubcategories in the predefined list.

In one embodiment, a record is generated for each match.

In one embodiment, one record is generated for a plurality of matches.

In one embodiment, the method further comprises associating at least onerecord with the text based data.

In one embodiment, the method further comprises associating at least onerecord with particular portions of text based data.

In one embodiment, the method further comprises storing at least onerecord and a link to the text based data in a database.

In one embodiment, the method further comprises calculating therelevance of the text based data. In one embodiment, calculatingcomprises counting the number of occurrences of a term from thepredefined list in the text based data.

In one embodiment, the categories include finance, accounting, or sales.

In another embodiment, the present invention includes a method ofconverting unstructured data into structured data comprising reading aplurality of unstructured text messages or communications, comparingsaid plurality of unstructured text messages or communications against apredefined list of terms, generating a structured record if a term in aparticular text message or communication matches a term in thepredefined list, and deleting the particular text message orcommunication if a term in the predefined list does not match any termin the particular text message or communication, and storing the recordsin a database. In one embodiment the predefined list includes categoriesof terms, and wherein the method further comprises grouping the recordsby the categories in the predefined list.

In another embodiment, the method may include associating each generatedrecord with the particular text message or communication.

These and other features of the present invention are detailed in thefollowing drawings and related description.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates how merely storing unstructured messages andcommunications in structured environments wasteful and inefficient.

FIG. 2 illustrates how industrial recognition may be used for screeningand organizing unstructured message and communication data according toone embodiment of the present invention.

FIG. 3 illustrates the general flow of the screening process accordingone embodiment of the current invention.

FIG. 4 shows two typical configurations of the output from the screeningprocess according one embodiment of the present invention.

FIG. 5 shows a sampling of industry-recognized categories according oneembodiment of the current invention.

FIG. 6 shows that for an industrial category, words and phrases that arecommonly used in that category are collected according one embodiment ofthe current invention.

FIG. 7 illustrates the separation of useful from useless informationaccording one embodiment of the current invention.

FIG. 8 illustrates an alternative way of looking at the effect ofscreening raw text according one embodiment of the current invention.

FIG. 9 illustrates how after the hits have been determined that the hitscan be grouped according one embodiment of the current invention.

FIG. 10 illustrates the overall screening process using industryrecognized terms and words according one embodiment of the currentinvention.

FIG. 11 illustrates an alternative use of the screening processaccording to one embodiment of the present invention.

DETAILED DESCRIPTION

Described herein are systems and methods of screening unstructuredmessages and communications. In the following description, for purposesof explanation, numerous examples and specific details are set forth inorder to provide a thorough understanding of the present invention. Itwill be evident, however, to one skilled in the art that the presentinvention as defined by the claims may include some or all of thefeatures in these examples alone or in combination with other featuresdescribed below, and may further include modifications and equivalentsof the features and concepts described herein.

Embodiments of the present invention allow unstructured messages andcommunications to be read and then to have meaningful terms (i.e., wordsand phrases) extracted out of the content of the text. In doing somessages and communications can be sorted into twoclassifications—messages and communications which are not useful forbusiness processing—sometimes called “blather,” and messages andcommunications that are useful for further processing in the context ofbusiness.

To make unstructured data useful for business purposes, it is necessaryto separate messages and communication containing useful informationfrom messages and communication with absolutely no useful information.Then, to save storage space and provide for efficient use of theinformation, the useful parts of the messages and communications shouldbe filtered out and classified by what business subjects to which theyare relevant. Currently there is no efficient and cost effective systemfor sorting data from unstructured data sources, which means there arehuge banks of data unavailable for making business decisions.

FIG. 2 illustrates how industrial recognition may be used for screeningand organizing unstructured message and communication data according toone embodiment of the present invention. In one embodiment, industrialrecognition of terms (i.e., words and phrases) in the message orcommunications is used to extract useful information. This may beregarded as an “ontological” approach to the screening of messages andcommunications. In the example of FIG. 2, email 210 and audio, forexample from a cell phone 211, may be transcribed by audio to textcomponent 212 (e.g., which may be hardware, software, or a combinationof hardware and software) to generate unstructured information 213. Itis to be understood that text messages or communications may be receivedfrom a variety of other sources including cell phone text messages, forexample. However, according to one embodiment of the present invention,the unstructured data 213 may be processed by a program 250 that appliesindustrial recognition to the data to extract relevant information.Program 250 may generate a structured output that may be stored indatabase 251, for example. Industrial recognition is the process ofapplying information that is known to be relevant to the incoming data,and extracting relevant data based on the result. For example, anindustrial recognition program may include a list of terms known to berelevant to a particular business. The relevant data may be extractedbased on whether or not one or more of the terms in the list isincluded. It is to be understood that a variety of complex extractionprocedures or algorithms may be used in this process. Generally, oneaspect of this invention is the recognition that unstructured messagesand communications may be transferred into the structured world byapplying information known to be relevant to a particular business.

FIG. 3 illustrates the general flow of data and processing of terms(words or phrases). FIG. 3 shows that email 310 or phone messages 311may be collected. The phone messages begin as audio messages and areconverted into text by audio to text component 312. Once converted intotext, the phone messages are collected along with the email messages. Atthis point both the email messages and the phone messages exist asunstructured raw text 330. The raw text is then passed through ascreening program 350, which may be referred to generally as anindustrial recognition screen (e.g. the “edit” screen shown in the FIG.3). The industrial recognition of words and phrases screen uses one ormore predefined lists 360 of industry recognized terms (i.e., words orphrases) to screen the raw text. Each word or phrase in the raw text ispassed against each word or phrase in the industry recognized lists. Atthe end of the screening process, every time a “hit” has occurred, arecord 370 may be created. A “hit” is made when there is a match betweena word or phrase from the raw text and the same word or phrase from theindustry recognized word list. Records 370 may, in turn, be stored in adatabase, and the database may be queried to access the records.Furthermore, as described in more detail below, the records may beassociated with the unstructured data (e.g., a record may be associatedwith an email that resulted in creation of the record). For example,records 370 may be stored in a database with links to the text baseddata. Accordingly, accessing structured information and/or associatedunstructured information may be done through the structured environment.

In one embodiment, a hit can be made on a literal word or a stemmedword. A literal word is an exact match. Take for example the literalword “moving”. A literal match of the words looks exactly for “moving”.A stemmed match looks for a match between word stems. For example, in astemmed search suppose the raw text has the word “moving”. If theindustry recognized list had the word “mover”, there would be a matchbecause both “moving” and “mover” have the same word stem—“move”. In oneembodiment, the matching done in the screening process shown in FIG. 3can be done either literally or on a stemmed basis.

In one embodiment, one or more lists of industry recognized words andphrases can be used in the screening process. For example, a screen mayuse lists such as an accounting list, a finance list, a sales list, anda human resources list.

In one embodiment, the same word may appear in more than one industryrecognized list. For example the word “account” may be found in theaccounting list, the sales list, and the finance list.

The output record is simple. The output record may include a variety ofdifferent fields of data, including but not limited to, raw textidentifier, raw text date, time, type of match, term matched, or anindustry recognized category, for example. Each word or phrase in theindustry recognized list may have a category. Typical categoriesinclude, but are not limited to, accounting, sales, engineering, andcompliance, for example.

An example industry recognized list for accounting includes, but is notlimited to, phrases such as payable, receivable, amount due, due date,interest, chart of accounts, account name and activity date.

Output from the processing of raw messages and communications passingthrough the screen in FIG. 3 might be as follows:

email 1244098

email date—May 13, 2003

literal match

“amount due”

category accounting

In one embodiment, a hit will be generated for every occurrence of thehit word in a single email. In one embodiment, an output record would beproduced every time a hit is made. In one embodiment, not only can wordsbe processed, but multiple words can also be processed. For example, thescreen may look for single words (e.g., “payable”), phrases (e.g., “duedate”) or various combinations thereof. There is no limitation on thesize of the phrase or the number of words in the phrase.

The output from the screen can be physically configured in several ways.FIG. 4 shows two of the ways the output can be configured. In FIG. 4 itis shown that there are individual physical records 470 for each hitmade by the screen. Alternatively, the data can be grouped in a singlerecord 480. Record 480 in FIG. 4 shows a raw text document that resultsin multiple hits. The record for such a screening activity might looklike the following:

Phone call: AJK776-198

Phone date: Mar. 14, 2005

Literal match: “the Jones account”

Category: accounting

Stem match: “transfer”

Category: sales

Literal match: “contingency sale”

Category: compliance

Stem match: “savings”

Sales

. . .

. . .

In one embodiment, the output is the same whether the records arecreated individually or whether the records are “batched” or groupedtogether.

FIG. 5 shows a sampling of the industry recognized categories. In oneembodiment, within each category there may be subcategories. Forexample, for sales, there may be subcategories such as:

sales for ranching

sales for road moving equipment

sales for sausage makers

sales for high tech

sales for drafting and graphic design, and so forth

In one embodiment, each industrial category there will be words that arefound in that category, such as seen in FIG. 6. FIG. 6 illustrates thatwords and phrases that are commonly used in an industrial category maybe collected.

Embodiments of the present invention may be used to screen raw text todetermine what messages and communications are blather and whichmessages and communications have real or potential business value.Blather is a message or communication that has no business value basedon the content of the text of the message or communication. FIG. 7 showssuch a separation.

FIG. 7 shows that raw messages and text 730 that have no hits on theirtext when screened against the lists of industry recognized words andphrases are considered to be blather 731. For example, an emailcontaining only the message:

-   -   “Let's do lunch”        has no business context in the normal sense. But the phone        message:    -   “I found the record for the Jones account. It was for Mar. 23,        2002 and was for $3,087.26 and was written by Mary Hastings. I        am going to forward the transcript of the transaction to you.”        will probably have real business value.

The screening program 750 would not pick up any words of interest in theemail and would thus classify the email as blather. The screeningprogram 750 may match up words and phrases from the phone conversationwith words or phrases on a list 760, and may show that the phoneconversation would have business value. In this case, the email would beconsidered to be blather and the phone conversation may be used togenerate one or more records or categories of records 770.

In one embodiment, once blather has been identified, it can be removed(i.e., deleted) from the email or telephone conversation data set. Theresult is a much smaller set of messages and emails that is much easierto handle than a larger set.

Another embodiment of screening raw text is shown by FIG. 8. In FIG. 8it is seen that raw text 830 enters the screening program 850, that thescreening program examines each word and phrase in the raw text, thathits are found, and records 870 are generated. In this example, therecords may be “assigned” to, or “associated with” the raw text orparticular portions of the raw text. The hits that have been made canthen be grouped, as seen in FIG. 9.

FIG. 9 shows that after the hits have been determined that the recordscan be grouped. In the case of the example in FIG. 9, most hits are fromfinance and one hit is from accounting. By merely adding up the hits, aprimary assignment can be made for the raw text. It can be inferred thatthe raw message or communication had a serious business relevance tofinance, a slight business relevance to accounting, and no businessrelevance to such categories as sales and engineering.

The larger picture of the screening process using industry recognizedterms and words (ontologies) is shown by FIG. 10.

In one embodiment, by using the screening process and the industryrecognized words and phrases, the organization can separate messages andcommunications into different categories; blather, useless to thebusiness, business useful and relevant words and phrases.

Another use of the screening process is shown in FIG. 11.

In one embodiment, after the raw text has been screened, that the hitscan be grouped by category or by message. Grouping by message mayinclude grouping records by terms in the list, message type (e.g., emailor audio), date, time, number of hits, etc. Grouping by category mayinclude grouping by categories or subcategories, for example.Accordingly, the accounting organization can quickly and easily find allthe messages and communications that are relevant to them, the financepeople can find their messages and communications, and so forth.

In one embodiment, there is another use for the information gained inthe screening process. That use is to not only tell what businesssubjects the message or communication is relevant to, but to calculatehow relevant the message or communication is. For example, suppose it isfound that a message or communication is relevant to both accounting andto finance. It is seen that there are thirteen references to accountingin the message or communication and only one reference to finance. Fromthis it can be inferred that the message or communication is morerelevant to accounting than to finance.

In one embodiment, it is useful to count the number of occurrences of abusiness relevant term in the message or communication. For example,suppose a message or communication has the word “account” occurring fivetimes. Only one business reference term record need be written out. Butthe fact that the word or phrase occurred multiple times can also berecorded. When the calculation is made as to how relevant a message orcommunication is to a business subject, the number of occurrences of aword or phrase is factored in as well as the number of different wordsor phrases were found in the message or communication.

The above description illustrates various embodiments of the presentinvention along with examples of how aspects of the present inventionmay be implemented. The above examples and embodiments should not bedeemed to be the only embodiments, and are presented to illustrate theflexibility and advantages of the present invention as defined by thefollowing claims. For example, information retrieval methods accordingto the present invention may include some or all of the innovativefeatures described above. Based on the above disclosure and thefollowing claims, other arrangements, embodiments, implementations andequivalents will be evident to those skilled in the art and may beemployed without departing from the spirit and scope of the invention asdefined by the claims.

1. A method of converting unstructured data into structured datacomprising: reading unstructured text based data; comparing saidunstructured text based data against a predefined list of terms; andgenerating one or more structured records if a term in the text baseddata matches a term in the predefined list.
 2. The method of claim 1wherein the unstructured text based data comprises a plurality of textmessages or communications, and wherein the method further comprisesautomatically deleting a message or communication if a term in thepredefined list does not match any term in the message or communication.3. The method of claim 1 further comprising storing the one or morerecords in a database.
 4. The method of claim 1 wherein the text baseddata are a plurality of emails.
 5. The method of claim 1 furthercomprising converting audio to text based data.
 6. The method of claim 1wherein terms in the text based data are compared against each term inthe predefined list.
 7. The method of claim 1 wherein a match occurs ifthe term in the text based data is an exact match with the term in thepredefined list.
 8. The method of claim 1 wherein a match occurs if theterm in the text based data is a stemmed match with the term in thepredefined list.
 9. The method of claim 1 wherein the predefined listincludes one or more categories.
 10. The method of claim 9 furthercomprising grouping records by categories in the predefined list. 11.The method of claim 9 wherein the predefined list includes one or moresubcategories.
 12. The method of claim 11 further comprising groupingrecords by subcategories in the predefined list.
 13. The method of claim1 wherein a record is generated for each match.
 14. The method of claim1 wherein one record is generated for a plurality of matches.
 15. Themethod of claim 1 further comprising associating at least one recordwith the text based data.
 16. The method of claim 15 further comprisingassociating at least one record with particular portions of text baseddata.
 17. The method of claim 15 further comprising storing at least onerecord and a link to the text based data in a database.
 18. The methodof claim 1 further comprising calculating the relevance of the textbased data.
 19. The method of claim 18 wherein calculating comprisescounting the number of occurrences of a term from the predefined list inthe text based data.
 20. A method of converting unstructured data intostructured data comprising: reading a plurality of unstructured textmessages or communications; comparing said plurality of unstructuredtext messages or communications against a predefined list of terms;generating a structured record if a term in a particular text message orcommunication matches a term in the predefined list, and deleting theparticular text message or communication if a term in the predefinedlist does not match any term in the particular text message orcommunication; and storing the records in a database.
 21. The method ofclaim 20 wherein the predefined list includes categories of terms, andwherein the method further comprises grouping the records by thecategories in the predefined list.
 22. The method of claim 20 furthercomprising associating each generated record with the particular textmessage or communication.
 23. The method of claim 20 wherein thecategories include finance, accounting, or sales.
 24. The method ofclaim 20 further comprising calculating the relevance of the text baseddata by counting the number of occurrences of a term from the predefinedlist in the text based data.
 25. The method of claim 20 wherein the textbased data are a plurality of emails.
 26. The method of claim 20 furthercomprising converting audio to text based data.